Business Development Prediction - Decision Tree - Regression
Demo
Live Web App Demo Link
Deployment on Heroku: https://mlbusinessdevelopmentapp.herokuapp.com/
Deployment on Streamlit: https://share.streamlit.io/monicadesai-tech/project_79/main/app.py
Abstract
The purpose of this report will be to use the BlackFriday.csv to predict sales of product depending on details provided by user.
This can be used to gain insight into how and why sales are such at a given time means more or less depending on city and occupation of citizens etc. This can also be used as a model to gain a marketing advantage, by advertisement targeting those who are more likely to purchase products on sales or less because of other spending’s. Sales Prediction is a regression problem, where using the various parameters or inputs from user model will supply result. This is diving into Business Development Prediction through Machine Learning Concept. End to End Project means that it is step by step process, starts with data collection, EDA, Data Preparation which includes cleaning and transforming then selecting, training and saving ML Models, Cross-validation and Hyper-Parameter Tuning and developing web service then Deployment for end users to use it anytime and anywhere.
This repository contains the code for Business Development Prediction using python’s various libraries. It used numpy, pandas, matplotlib, seaborn, sklearn, streamlit and pickle libraries. These libraries help to perform individually one particular functionality. Numpy is used for working with arrays. It stands for Numerical Python. Pandas objects rely heavily on Numpy objects. Matplotlib is a plotting library. Seaborn is data visualization library based on matplotlib. Sklearn has 100 to 200 models. “Pickling” is the process whereby a Python object hierarchy is converted into a byte stream. Streamlit is an open-source Python library that makes it easy to create and share beautiful, custom web apps for machine learning. The purpose of creating this repository is to gain insights into Complete ML Project. These python libraries raised knowledge in discovering these libraries with practical use of it. It leads to growth in my ML repository. These above screenshots and video in Video_File Folder will help you to understand flow of output.
Motivation
The reason behind building this project is, because I have worked in this domain and have knowledge on strategies applied to increase brand awareness and management that it requires to work with not only one department but with various levels of people and managing supply demand ratio along with CRM. Identifying prospect needs as per their design requirements with timely delivery is one of the challenging tasks. And, creating report analysis to gain insights into revenue and cost management. Since, I have knowledge in Python, I thought to combine both of them to create Business Development Project as a whole to gain insights from IT perspective to know working of the model. Hence, I continue to spread tech wings in IT Heaven.
The Data
It shows there are 537577 total observations/rows and 12 columns in a given dataset.
Here are all the features included in the data set, and a short description of them all.
It has 7 numeric columns out of 12.
It displays number of missing/ null values in each column.
It indicates name of each column, unique categories in a selected column and number of unique values to each of those categories in that column.
Analysis of Data
Basic Statistics
Graphing of Features
Graph Set 1
Graph Set 2
Graph Set 3
Modelling
The purpose of these models will be to get effective insight into the following:
1. If Sales of products depending on various factors:
• This insight can be used for Market Targeting.
2. Get insight into how changing RMSE of the predictions affect:
• Spending more money to target the potential customers that are most likely to purchase product and loyalty customers or spending less money on services provided and few offers for particular age category or reasons responsible for lower sales.
Math behind the metrics
In multilabel classification, this function computes subset accuracy: the set of labels predicted for a sample must exactly match the corresponding set of labels in y_true.
Parameters it take,
1)y_true : 1d array-like, or label indicator array / sparse matrix Ground truth (correct) labels.
2)y_pred : 1d array-like, or label indicator array / sparse matrix Predicted labels, as returned by a classifier.
3)normalize: bool, optional (default=True) If False, return the number of correctly classified samples. Otherwise, return the fraction of correctly classified samples.
4)sample_weight : array-like of shape = [n_samples], optional Sample weights.
In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors—that is, the average squared difference between the estimated values and the actual value. MSE is a risk function, corresponding to the expected value of the squared error loss. The fact that MSE is almost always strictly positive (and not zero) is because of randomness or because the estimator does not account for information that could produce a more accurate estimate.
The MSE is a measure of the quality of an estimator—it is always non-negative, and values closer to zero are better.
When using the Random Forest Algorithm to solve regression problems, you are using the mean squared error (MSE) to how your data branches from each node.
This formula calculates the distance of each node from the predicted actual value, helping to decide which branch is the better decision for your forest. Here, yi is the value of the data point you are testing at a certain node and fi is the value returned by the decision tree.
R-squared (Coefficient of determination) represents the coefficient of how well the values fit compared to the original values. The value from 0 to 1 interpreted as percentages. The higher the value is, the better the model is.
RMSE (Root Mean Squared Error) is the error rate by the square root of MSE.
Decision Tree Regression Formula:
Example 1:
Linear Regression Formula:
Algorithm for Linear Regression:
Algorithm for Decision Trees:
Reference Link
Model Architecture Process Through Visualization
Quick Notes
Step 1: Imported essential libraries.
Step 2: Loaded and read the data.
Step 3: Analysed the data using ‘.info(), .value_counts(), .unique(), .isnull().sum()’ commands and imputing it.
Step 4: Performed data pre-processing through cleaning and saved new cleaned data file.
Step 5: Performed Exploratory Data Analysis (EDA) on the data.
Step 6: Split dataset into train and test set in order to prediction w.r.t X_test.
Step 7: Performed Model Building, Prediction, Evaluation. Imported Linear Regressor Model and Fitted the Data on it using ‘.fit()’ function. Then used ‘.predict()’ function for predicting performance. And then, checked evaluation of model.
Step 8: Saved the model as pickle file to re-use.
Step 9: Performed Model Building, Prediction, Evaluation. Imported Decision Tree Regressor Model and Fitted the Data on it using ‘.fit()’ function. Then used ‘.predict()’ function for predicting performance. And then, checked evaluation of model.
Step 10: Saved the model as pickle file to re-use.
Step 11: Loaded Decision Tree Model which was created.
Step 12: Created word cloud to wish respective occasion in a selected language.
Step 13: Created Web App for end-users.
The Model Analysis
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns Imported essential libraries - When we import modules, we're able to call functions that are not built into Python. Some modules are installed as part of Python, and some we will install through pip. Making use of modules allows us to make our programs more robust and powerful as we're leveraging existing code.
df = pd.read_csv("BlackFriday.csv")
df.head() Loaded and read the data - You can import tabular data from CSV files into pandas data frame by specifying a parameter value for the file. Imported BlackFriday.csv file here.
df.info()
df.describe()
df.isnull().sum()
df.columns
df['City_Category'].unique()
df['City_Category'].value_counts() Analysed the data - Using ‘.info(), .value_counts(), .unique(), .isnull().sum()’ commands and imputing it. Checking unique number of categories in each column.
cc_df = df['City_Category'].value_counts().to_frame()
cc_df = cc_df.reset_index()
cc_df Performed Data Pre-processing – Data pre-processing is crucial in any data mining process as they directly impact success rate of the project. Data is said to be unclean if it is missing attribute, attribute values, contain noise or outliers and duplicate or wrong data. Presence of any of these will degrade quality of the results. Checking missing values to deal with them by replacing by zero or mean value. Converting data types using ‘.astype()’ here. Replacing unwanted signs using ‘.replace()’ function. fit_transform() joins these two steps and is used for the initial fitting of parameters on the training set x, while also returning the transformed x′. Internally, the transformer object just calls first fit() and then transform() on the same data.
sns.countplot(df['City_Category'])
df.columns
df['Gender'].value_counts()
df['Occupation'].value_counts()
df['Age'].value_counts()
df['Purchase']
df.columns
df['Marital_Status'].unique()
def count_plot(dataframe, column_name, title =None, hue = None):
'''
Function to plot seaborn count plot
Input: Dataframe name that has to be plotted, column_name that has to be plotted, title for the graph
Output: Plot the data as a count plot
'''
base_color = sns.color_palette()[0]
sns.countplot(data = dataframe, x = column_name, hue=hue)
plt.title(title)
pass
count_plot(df,'Age','Gender')
df.head()
df.groupby(['Gender','Marital_Status']).size().to_frame()
mr_gender = df.groupby(['Gender','Marital_Status']).size().to_frame()
marital_df =mr_gender.reset_index()
marital_df.columns
marital_df.rename(columns={0:'Counts'},inplace=True)
marital_df
df.shape
df['Age'].unique()
age_map = {"Children":"0-17","Young Adult":"18-25","Young Adult(Prime)":"26-35","Middle Adult":"36-45","Late Adult":"46-50","Early Old Age":"55+"}
df.head()
df['Product_ID'].unique().tolist()
def make_dict(col):
d = {v:k for k,v in enumerate(col.unique())}
return d
age_dict = make_dict(df['Age'])
age_dict
age_dict = {'0-17': 1,
'55+': 7,
'26-35': 3,
'46-50': 5,
'51-55': 6,
'36-45': 4,
'18-25': 2}
age_dict
city_dict = make_dict(df['City_Category'])
city_dict = {'A': 0, 'C': 2, 'B': 1}
df['Stay_In_Current_City_Years'].unique()
df['Stay_In_Current_City_Years'].replace('4+',4,inplace=True)
df['Occupation'].unique()
df['Marital_Status'].unique()
df['Product_Category_1'].isnull().sum()
df['Product_Category_2'].isnull().sum()
df['Product_Category_2'].unique()
df['Product_Category_2'].fillna(df['Product_Category_2'].value_counts().idxmax(),inplace=True)
df['Product_Category_3'].fillna(df['Product_Category_3'].value_counts().idxmax(),inplace=True)
df.isnull().sum()
df.dtypes
df_clean = df
df_clean.to_csv("Black_Friday_No_Missing_Value.csv")
df['Gender'] = df['Gender'].map({"F":0,"M":1})
df['Age'] = df['Age'].map(age_dict)
age_dict
df['City_Category'] = df['City_Category'].map(city_dict)
df.dtypes
df['Stay_In_Current_City_Years'] = df['Stay_In_Current_City_Years'].astype(int)
df.dtypes
df.to_csv("Black_Friday_Data_Encoded.csv")
df.columns
df2 = df[['Gender', 'Age', 'Occupation', 'City_Category',
'Stay_In_Current_City_Years', 'Marital_Status', 'Product_Category_1',
'Product_Category_2', 'Product_Category_3', 'Purchase']]
sns.heatmap(df2.corr(),annot=True)
Xfeatures = df[['Gender', 'Age', 'Occupation', 'City_Category',
'Stay_In_Current_City_Years', 'Marital_Status', 'Product_Category_1',
'Product_Category_2', 'Product_Category_3']]
Xfeatures
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X = scaler.fit_transform(Xfeatures)
Xfeatures.head()
X
Xfeatures.columns
X2 = pd.DataFrame(X,columns=Xfeatures.columns)
X2.head() Performed EDA – The primary goal of EDA is to maximize the analyst's insight into a data set and into the underlying structure of a data set, while providing all of the specific items that an analyst would want to extract from a data set, such as: a good-fitting, parsimonious model and a list of outliers. There are many libraries available in python like pandas, NumPy, matplotlib, seaborn to perform EDA. The four types of EDA are univariate non-graphical, multivariate non- graphical, univariate graphical, and multivariate graphical. Checked relation between columns through plotting correlation heatmap. A correlation heatmap uses coloured cells, typically in a monochromatic scale, to show a 2D correlation matrix (table) between two discrete dimensions. Correlation ranges from -1 to +1. Values closer to zero means there is no linear trend between the two variables. The close to 1 the correlation is the more positively correlated they are; that is as one increases so does the other and the closer to 1 the stronger this relationship is.
from sklearn.model_selection import train_test_split
y = df['Purchase']
X_train, X_test, y_train, y_test = train_test_split(
X2, y, test_size=0.2, random_state=42)
X_train2, X_test2, y_train2, y_test2 = train_test_split(
Xfeatures, y, test_size=0.2, random_state=42) Split dataset - Splitting data into train and test set in order to prediction w.r.t X_test. This helps in prediction after training data. Used sklearn’s ‘train_test_split()’ function.
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
lr_model2 = LinearRegression()
lr_model2.fit(X_train2,y_train2)
y_pred_lr2 = lr_model2.predict(X_test2)
print("Linear Regression: ")
print("RMSE:",np.sqrt(mean_squared_error(y_test2, y_pred_lr2)))
print("R2 score:", r2_score(y_test2, y_pred_lr2)) Performed Model Building, Prediction, Evaluation - The fit() method takes the training data as arguments, which can be one array in the case of unsupervised learning, or two arrays in the case of supervised learning. Predicted test data w.r.t. X_test. It helps in better analysing performance of model. Now, performed model evaluation by using MSE, RMSE and R2 Score Error Measures. It indicates how better model is performing. To evaluate the prediction error rates and model performance in regression analysis. Once you have obtained your error metric/s, take note of which X’s have minimal impacts on y. Removing some of these features may result in an increased accuracy of your model. RMSE: Most popular metric, similar to MSE, however, the result is square rooted to make it more interpretable as it’s in base units. It is recommended that RMSE be used as the primary metric to interpret your model. All of them require two lists as parameters, with one being your predicted values and the other being the true values.
import joblib
lr_model2_file = open("lr2_bf_sales_model_23_oct.pkl","wb")
joblib.dump(lr_model2,lr_model2_file)
lr_model2_file.close() Saved the model – Saving the (Linear Regression) created model as pickle file for future reference and use so that we can directly access it rather than to go through full cycle again.
from sklearn.tree import DecisionTreeRegressor
dtree2 = DecisionTreeRegressor()
dtree2.fit(X_train2,y_train2)
y_pred_dt2 = dtree2.predict(X_test2)
print("Decision Tree Regression: ")
print("RMSE:",np.sqrt(mean_squared_error(y_test2, y_pred_dt2)))
print("R2 score:", r2_score(y_test2, y_pred_dt2)) Performed Model Building, Prediction, Evaluation - The fit() method takes the training data as arguments, which can be one array in the case of unsupervised learning, or two arrays in the case of supervised learning. Predicted test data w.r.t. X_test. It helps in better analysing performance of model. Now, performed model evaluation by using MSE, RMSE and R2 Score Error Measures. It indicates how better model is performing. To evaluate the prediction error rates and model performance in regression analysis. Once you have obtained your error metric/s, take note of which X’s have minimal impacts on y. Removing some of these features may result in an increased accuracy of your model. RMSE: Most popular metric, similar to MSE, however, the result is square rooted to make it more interpretable as it’s in base units. It is recommended that RMSE be used as the primary metric to interpret your model. All of them require two lists as parameters, with one being your predicted values and the other being the true values.
dt2_model_file = open("dt2_bf_sales_model_23_oct.pkl","wb")
joblib.dump(dtree2,dt2_model_file)
dt2_model_file.close() Saved the model – Saving the (Decision Tree Regression) created model as pickle file for future reference and use so that we can directly access it rather than to go through full cycle again.
thanksgiving_day = {"English":"Happy Thanksgiving Day",
"French":"le Jour de Merci Donnant",
"Spanish":"el Día de Acción de Gracias",
"Portuguese": "O Dia de Acção de Graças",
"German":"Danksagung",
"Hebrew":"חג ההודיה שמח",
"Russian":"С Днем Благодарения день",
"Igbo":"Ekele ụbọchị",
"Yoruba":"ojó idupe",
"Hindi":"धन्यवाद दिवस की शुभकामनाएं",
"Arabic":"عيد شكر سعيد",
"Danish": "Helligdag",
"Twi":"Aseda Da",
"Dutch": "Gedenkdagen",
"Swedish": "Helgdag",
"Polish": "Dzień Dziękczynienia",
"Chinese": "感恩節日 [感恩节日] (gănēnjiérì)",
"Japanese":"感謝祭 (kanshasai)"}
happy ={
"English":"Happy Thanksgiving",
"French": "Action de grâce",
"Spanish":"Feliz Día de Acción de Gracias!",
"Portuguese": "Feliz (dia de) acção de graças",
"German": "Herzliche Danksagung",
"Hebrew":"חג הודיה שמח",
"Russian":"счастливого дня благодарения",
"Igbo":"ekele ekele",
"Yoruba":"idunnu idunnu",
"Hindi":"थैंक्सगिविंग की शुभकामनाएं",
"Arabic":"عيد شكر سعيد",
"Danish": "Glædelig Helligdag",
"Twi":"Aseda(Afehyiapa)",
"Dutch": "Vrolijke gedenkdagen",
"Swedish": "God Helgdag",
"Polish": "Święto dziękczynienia",
"Chinese":"感恩節快樂 [感恩节快乐] (gănēnjié kuàilè)",
"Japanese":"感謝祭おめでとう (kanshasai omedetō)"}
tdf = pd.DataFrame([thanksgiving_day,happy])
tdf.T
tdf.columns
tdf2 = pd.DataFrame(tdf.T)
tdf2.head()
tdf2 = tdf2.reset_index()
tdf2.columns = ["Language","Day","Word"]
tdf2.head()
tdf2.to_csv("thanksgiving_in_multi_lang_data.csv")
from wordcloud import WordCloud
day = " ".join(tdf2['Day'].tolist())
mywordcloud = WordCloud().generate(day)
plt.imshow(mywordcloud,interpolation='bilinear')
plt.show() Created word cloud – A tag cloud (word cloud or wordle or weighted list in visual design) is a novelty visual representation of text data, typically used to depict keyword metadata (tags) on websites, or to visualize free form text. Tags are usually single words, and the importance of each tag is shown with font size or color. Word clouds can allow you to share back results from research in a way that doesn't require an understanding of the technicalities.
##eda_app.py
import streamlit as st
import pandas as pd
# Data Viz Pkgs
import matplotlib.pyplot as plt
import matplotlib
matplotlib.use('Agg')
import seaborn as sns
import plotly.express as px
@st.cache
def load_data(data):
df = pd.read_csv(data)
return df
def count_plot(dataframe, column_name, title =None, hue = None):
'''
Function to plot seaborn count plot
Input: Dataframe name that has to be plotted, column_name that has to be plotted, title for the graph
Output: Plot the data as a count plot
'''
base_color = sns.color_palette()[0]
sns.countplot(data = dataframe, x = column_name, hue=hue)
plt.title(title)
pass
def run_eda():
st.subheader("EDA")
submenu = st.sidebar.selectbox("Submenu",["EDA","Plots"])
df = load_data("data/BlackFriday.csv")
if submenu == "EDA":
st.subheader("Exploratory Data")
st.dataframe(df.head())
c1,c2 = st.beta_columns(2)
with st.beta_expander("Descriptive Summary"):
st.dataframe(df.describe())
with c1:
with st.beta_expander("Gender Distribution"):
st.dataframe(df['Gender'].value_counts())
with c2:
with st.beta_expander("Age Distribution"):
st.dataframe(df['Age'].value_counts())
elif submenu == "Plots":
st.subheader("Plotting")
col1,col2 = st.beta_columns(2)
with col1:
with st.beta_expander("Pie Chart (Gender)"):
gen_df = df['Gender'].value_counts().to_frame()
gen_df = gen_df.reset_index()
gen_df.columns = ['Gender Type','Counts']
# st.dataframe(gen_df)
p01 = px.pie(gen_df,names='Gender Type',values='Counts')
st.plotly_chart(p01,use_container_width=True)
with st.beta_expander("City"):
city_df = df['City_Category'].value_counts().to_frame()
city_df = city_df.reset_index()
city_df.columns = ['Category','Counts']
p01 = px.pie(city_df,names='Category',values='Counts')
st.plotly_chart(p01,use_container_width=True)
with col2:
with st.beta_expander("Bar Chart(Gender)"):
fig = plt.figure()
sns.countplot(df['Gender'])
st.pyplot(fig)
with st.beta_expander("Plot of Occupation"):
fig = plt.figure()
sns.countplot(df['Occupation'])
st.pyplot(fig)
with st.beta_expander("Age"):
age_df = df['Age'].value_counts().to_frame()
age_df = age_df.reset_index()
age_df.columns = ['Age Range','Counts']
p01 = px.bar(age_df,x='Age Range',y='Counts')
st.plotly_chart(p01,use_container_width=True)
with st.beta_expander("Gender vs Marital Status"):
marital_df = df.groupby(['Gender','Marital_Status']).size().to_frame().reset_index()
marital_df.rename(columns={0:'Counts'},inplace=True)
po2 = px.bar(marital_df,x='Marital_Status',y='Counts',color='Gender')
st.plotly_chart(po2)
##home_page.py
import streamlit as st
import pandas as pd
# Data Viz Pkgs
import matplotlib.pyplot as plt
import matplotlib
matplotlib.use('Agg')
from wordcloud import WordCloud
@st.cache
def load_data(data):
df = pd.read_csv(data)
return df
def run_home_page():
df = load_data("data/thanksgiving_in_multi_lang.csv")
# st.dataframe(df)
with st.beta_expander("Happy Thanksgiving Day",expanded=True):
day_text = " ".join(df['Day'].tolist())
mywordcloud = WordCloud().generate(day_text)
fig = plt.figure()
plt.imshow(mywordcloud,interpolation='bilinear')
plt.axis('off')
st.pyplot(fig)
lang_list = df['Language'].unique().tolist()
lang_choice = st.sidebar.selectbox("Lang",lang_list)
if lang_choice:
thank_word = df[df["Language"] == lang_choice].iloc[0].Word
thank_day = df[df["Language"] == lang_choice].iloc[0].Day
st.info("How to Say Happy Thanksgiving in {}".format(lang_choice))
st.write({"lang":lang_choice,"word":thank_word,"day":thank_day})
name = st.text_input("Name","Streamlit")
bgcolor = st.beta_color_picker("")
modified_name = "From {0} {0} {0}".format(name)
updated_text = []
updated_text.append(modified_name)
updated_text.extend(df['Word'].tolist())
# st.write(updated_text)
new_text = " ".join(updated_text)
with st.beta_expander("Thanksgiving From {}".format(name)):
mywordcloud = WordCloud(background_color=bgcolor).generate(new_text)
fig = plt.figure()
plt.imshow(mywordcloud,interpolation='bilinear')
plt.axis('off')
st.pyplot(fig)
##ml_app.py
# Core Pkgs
import streamlit as st
# Utils
import numpy as np
import joblib
import os
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
def scale_data(x):
X = scaler.fit_transform(x)
return X
age_dict = {'0-17': 1,'55+': 7,'26-35': 3,'46-50': 5,'51-55': 6,'36-45': 4,'18-25': 2}
gender_dict = {"Female":0,"Male":1}
marital_status_dict = {"Single":0,"Married":1}
city_dict = {'A': 0,'B': 1,'C': 2}
def get_value(val,my_dict):
for key,value in my_dict.items():
if val == key:
return value
# Load ML Models
@st.cache
def load_model(model_file):
loaded_model = joblib.load(open(os.path.join(model_file),"rb"))
return loaded_model
def run_ml():
st.subheader("Black Friday Sales Predictor")
col1,col2 = st.beta_columns(2)
with col1:
gender = st.radio("Gender",("Female","Male"))
age = st.number_input("Age",1,75)
occupation = st.number_input("Occupation",1,20)
city_category = st.selectbox("City Category",["A","B","C"])
stay_in_current_city = st.number_input("No of Years of Stay in Current City",1,10)
with col2:
marital_status = st.radio("Marital Status",("Single","Married"))
product_category_1 = st.number_input("Product 1",1,20)
product_category_2 = st.number_input("Product 2",1,20)
product_category_3 = st.number_input("Product 3",1,20)
selected_options = {'Gender':gender,'Age':age,'Occupation':occupation, 'City_Category':city_category,
'Stay_In_Current_City_Years':stay_in_current_city, 'Marital_Status':marital_status, 'Product_Category_1':product_category_1,
'Product_Category_2':product_category_2, 'Product_Category_3':product_category_3}
gender_en = get_value(gender,gender_dict)
city_category_en = get_value(city_category,city_dict)
marital_status_en = get_value(marital_status,marital_status_dict)
single_sample = [gender_en,age,occupation,city_category_en,stay_in_current_city,marital_status_en,product_category_1,product_category_2,product_category_2]
# st.write(single_sample)
st.write(selected_options)
if st.button("Predict"):
# scaled_sample = scale_data(np.array(single_sample).reshape(1,-1))
# st.write(scaled_sample)
sample = np.array(single_sample).reshape(1,-1)
model = load_model("models/lr2_bf_sales_model_23_oct.pkl")
prediction = model.predict(sample)
st.info("Predicted Purchase")
st.write("Purchased:${}".format(prediction[0]))
st.balloons()
##app.py
import streamlit as st
import streamlit.components.v1 as stc
from home_page import run_home_page
from eda_app import run_eda
from ml_app import run_ml
html_temp = """
Black Friday Sales App
Happy Thanksgiving
"""
def main():
stc.html(html_temp)
menu = ["Home","EDA","ML","About"]
choice = st.sidebar.selectbox("Menu",menu)
if choice == "Home":
run_home_page()
pass
elif choice == "EDA":
run_eda()
elif choice == "ML":
run_ml()
else:
st.subheader("About")
st.info("Built with Streamlit")
st.text("India - Monica Desai")
st.text("Data Science Project")
if __name__ == '__main__':
main() Created Web App – Created web App in Streamlit for end-users.
Linear Regression – Base Model
1)Model Training
2)Predictions
3)Model Evaluations
Decision Tree Regression – First Model
1)Model Training
2)Predictions
3)Model Evaluations
Overall Model Analysis
Challenge that I faced, not very clear relationship between vars. While building model, accuracy performance.
Reason for selecting regression trees are used when the dependent variable is continuous. For regression trees, the value of terminal nodes is the mean of the observations falling in that region. Therefore, if an unseen data point falls in that region, we predict using the mean value. The Decision Tree Regression is both non-linear and non-continuous model.
Creation of App
Here, I am created Streamlit App. Created a function to display graph based on user selected variable. In home_app, created word cloud by taking language as input. In ml_app, scaled the data and then loaded the model that we saved then taking few user input parameters and finally based on that providing results. Finally, render respective .html page for solution.
Technical Aspect
In multilabel classification, ‘accuracy_score()’ function computes subset accuracy: the set of labels predicted for a sample must exactly match the corresponding set of labels in y_true.
MSE: Similar to MAE but noise is exaggerated and larger errors are “punished”. It is harder to interpret than MAE as it’s not in base units, however, it is generally more popular.
RMSE (Root Mean Squared Error) is the error rate by the square root of MSE.
R-squared (Coefficient of determination) represents the coefficient of how well the values fit compared to the original values. The value from 0 to 1 interpreted as percentages. The higher the value is, the better the model is.
Numpy used for working with arrays. It also has functions for working in domain of linear algebra, fourier transform, and matrices. It contains a multi-dimensional array and matrix data structures. It can be utilised to perform a number of mathematical operations on arrays.
Pandas module mainly works with the tabular data. It contains Data Frame and Series. Pandas is 18 to 20 times slower than Numpy. Pandas is seriously a game changer when it comes to cleaning, transforming, manipulating and analyzing data.
Matplotlib is used for EDA. Visualization of graphs helps to understand data in better way than numbers in table format. Matplotlib is mainly deployed for basic plotting. It consists of bars, pies, lines, scatter plots and so on. Inline command display visualization inline within frontends like in Jupyter Notebook, directly below the code cell that produced it.
Seaborn provides a high-level interface for drawing attractive and informative statistical graphics. It provides a variety of visualization patterns and visualize random distributions.
Sklearn is known as scikit learn. It provides many ML libraries and algorithms for it. It provides a range of supervised and unsupervised learning algorithms via a consistent interface in Python.
Streamlit is an open-source Python library that makes it easy to create and share beautiful, custom web apps for machine learning and data science. In just a few minutes you can build and deploy powerful data apps. It is a very easy library to create a perfect dashboard by spending a little amount of time. It also comes with the inbuilt webserver and lets you deploy in the docker container. When you run the app, the localhost server will open in your browser automatically.
‘StandardScaler()’ removes the mean and scales each feature/variable to unit variance. This operation is performed feature-wise in an independent way. ‘StandardScaler()’ can be influenced by outliers (if they exist in the dataset) since it involves the estimation of the empirical mean and standard deviation of each feature.
Need to train_test_split - Using the same dataset for both training and testing leaves room for miscalculations, thus increases the chances of inaccurate predictions. The function train_test_split allows you to break a dataset with ease while pursuing an ideal model. Also, keep in mind that your model should not be overfitting or underfitting.
Installation
Using intel core i5 9th generation with NVIDIA GFORECE GTX1650.
Windows 10 Environment Used.
Already Installed Anaconda Navigator for Python 3.x
The Code is written in Python 3.8.
If you don't have Python installed you can install Python from its official site.
If you are using a lower version of Python you can upgrade using the pip package, ensuring you have the latest version of pip, python -m pip install --upgrade pip and press Enter.
Run-How to Use-Steps
Keep your internet connection on while running or accessing files and throughout too.
Follow this when you want to perform from scratch.
Open Anaconda Prompt, Perform the following steps:
cd
pip install matplotlib
pip install seaborn
pip install numpy
pip install streamlit
Note: If it shows error as ‘No Module Found’ , then install relevant module.
You can also create requirement.txt file as, pip freeze > requirements.txt
Create Virtual Environment:
conda create -n bd python=3.6
y
conda activate bd
cd
run .py or .ipynb files.
Paste URL to browser to check whether working locally or not.
Follow this when you want to just perform on local machine.
Download ZIP File.
Right-Click on ZIP file in download section and select Extract file option, which will unzip file.
Move unzip folder to desired folder/location be it D drive or desktop etc.
Open Anaconda Prompt, write cd
eg: cd C:\Users\Monica\Desktop\Projects\Python Projects 1\ 23)End_To_End_Projects\Project_11_ML_FileUse_End_To_End_Business_Development_App\ Project_ML_BuDev
conda create -n bd python=3.6
y
conda activate bd
In Anconda Prompt, pip install -r requirements.txt to install all packages.
In Anconda Prompt, write streamlit run app.py and press Enter.
Paste the URL (if it doesnot open automatically) to browser to check whether working locally or not.
Please be careful with spellings or numbers while typing filename and easier is just copy filename and then run it to avoid any silly errors.
Note: cd
[Go to Folder where file is. Select the path from top and right-click and select copy option and paste it next to cd one space
Directory Tree-Structure of Project
To Do-Future Scope
Can deploy on AWS and Google Cloud.
Technologies Used-System Requirements-Tech Stack
Conclusion
Modeling
Stacking of modelling needs to be attempted to see effect.
Analysis
Credits
JCharis and J-Tech Security Channel
Paper Citation
Paper Citation Link1 here
Paper Citation Link2 here